{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# \ud83d\udcc3 Solution for Exercise M1.05\n",
    "\n",
    "The goal of this exercise is to evaluate the impact of feature preprocessing\n",
    "on a pipeline that uses a decision-tree-based classifier instead of a logistic\n",
    "regression.\n",
    "\n",
    "- The first question is to empirically evaluate whether scaling numerical\n",
    "  features is helpful or not;\n",
    "- The second question is to evaluate whether it is empirically better (both\n",
    "  from a computational and a statistical perspective) to use integer coded or\n",
    "  one-hot encoded categories."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "adult_census = pd.read_csv(\"../datasets/adult-census.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "target_name = \"class\"\n",
    "target = adult_census[target_name]\n",
    "data = adult_census.drop(columns=[target_name, \"education-num\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As in the previous notebooks, we use the utility `make_column_selector` to\n",
    "select only columns with a specific data type. Besides, we list in advance all\n",
    "categories for the categorical columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.compose import make_column_selector as selector\n",
    "\n",
    "numerical_columns_selector = selector(dtype_exclude=object)\n",
    "categorical_columns_selector = selector(dtype_include=object)\n",
    "numerical_columns = numerical_columns_selector(data)\n",
    "categorical_columns = categorical_columns_selector(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reference pipeline (no numerical scaling and integer-coded categories)\n",
    "\n",
    "First let's time the pipeline we used in the main notebook to serve as a\n",
    "reference:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import time\n",
    "\n",
    "from sklearn.model_selection import cross_validate\n",
    "from sklearn.pipeline import make_pipeline\n",
    "from sklearn.compose import make_column_transformer\n",
    "from sklearn.preprocessing import OrdinalEncoder\n",
    "from sklearn.ensemble import HistGradientBoostingClassifier\n",
    "\n",
    "categorical_preprocessor = OrdinalEncoder(\n",
    "    handle_unknown=\"use_encoded_value\", unknown_value=-1\n",
    ")\n",
    "preprocessor = make_column_transformer(\n",
    "    (categorical_preprocessor, categorical_columns),\n",
    "    remainder=\"passthrough\",\n",
    ")\n",
    "\n",
    "\n",
    "model = make_pipeline(preprocessor, HistGradientBoostingClassifier())\n",
    "\n",
    "start = time.time()\n",
    "cv_results = cross_validate(model, data, target)\n",
    "elapsed_time = time.time() - start\n",
    "\n",
    "scores = cv_results[\"test_score\"]\n",
    "\n",
    "print(\n",
    "    \"The mean cross-validation accuracy is: \"\n",
    "    f\"{scores.mean():.3f} \u00b1 {scores.std():.3f} \"\n",
    "    f\"with a fitting time of {elapsed_time:.3f} seconds\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Scaling numerical features\n",
    "\n",
    "Let's write a similar pipeline that also scales the numerical features using\n",
    "`StandardScaler` (or similar):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# solution\n",
    "import time\n",
    "\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "preprocessor = make_column_transformer(\n",
    "    (StandardScaler(), numerical_columns),\n",
    "    (\n",
    "        OrdinalEncoder(handle_unknown=\"use_encoded_value\", unknown_value=-1),\n",
    "        categorical_columns,\n",
    "    ),\n",
    ")\n",
    "\n",
    "model = make_pipeline(preprocessor, HistGradientBoostingClassifier())\n",
    "\n",
    "start = time.time()\n",
    "cv_results = cross_validate(model, data, target)\n",
    "elapsed_time = time.time() - start\n",
    "\n",
    "scores = cv_results[\"test_score\"]\n",
    "\n",
    "print(\n",
    "    \"The mean cross-validation accuracy is: \"\n",
    "    f\"{scores.mean():.3f} \u00b1 {scores.std():.3f} \"\n",
    "    f\"with a fitting time of {elapsed_time:.3f}\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "source": [
    "### Analysis\n",
    "\n",
    "We can observe that both the accuracy and the training time are approximately\n",
    "the same as the reference pipeline (any time difference you might observe is\n",
    "not significant).\n",
    "\n",
    "Scaling numerical features is indeed useless for most decision tree models in\n",
    "general and for `HistGradientBoostingClassifier` in particular."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## One-hot encoding of categorical variables\n",
    "\n",
    "We observed that integer coding of categorical variables can be very\n",
    "detrimental for linear models. However, it does not seem to be the case for\n",
    "`HistGradientBoostingClassifier` models, as the cross-validation score of the\n",
    "reference pipeline with `OrdinalEncoder` is reasonably good.\n",
    "\n",
    "Let's see if we can get an even better accuracy with `OneHotEncoder`.\n",
    "\n",
    "Hint: `HistGradientBoostingClassifier` does not yet support sparse input data.\n",
    "You might want to use `OneHotEncoder(handle_unknown=\"ignore\",\n",
    "sparse_output=False)` to force the use of a dense representation as a\n",
    "workaround."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# solution\n",
    "import time\n",
    "\n",
    "from sklearn.preprocessing import OneHotEncoder\n",
    "\n",
    "categorical_preprocessor = OneHotEncoder(\n",
    "    handle_unknown=\"ignore\", sparse_output=False\n",
    ")\n",
    "preprocessor = make_column_transformer(\n",
    "    (categorical_preprocessor, categorical_columns),\n",
    "    remainder=\"passthrough\",\n",
    ")\n",
    "\n",
    "model = make_pipeline(preprocessor, HistGradientBoostingClassifier())\n",
    "\n",
    "start = time.time()\n",
    "cv_results = cross_validate(model, data, target)\n",
    "elapsed_time = time.time() - start\n",
    "\n",
    "scores = cv_results[\"test_score\"]\n",
    "\n",
    "print(\n",
    "    \"The mean cross-validation accuracy is: \"\n",
    "    f\"{scores.mean():.3f} \u00b1 {scores.std():.3f} \"\n",
    "    f\"with a fitting time of {elapsed_time:.3f}\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "solution"
    ]
   },
   "source": [
    "### Analysis\n",
    "\n",
    "From an accuracy point of view, the result is almost exactly the same. The\n",
    "reason is that `HistGradientBoostingClassifier` is expressive and robust\n",
    "enough to deal with misleading ordering of integer coded categories (which was\n",
    "not the case for linear models).\n",
    "\n",
    "However from a computation point of view, the training time is much longer:\n",
    "this is caused by the fact that `OneHotEncoder` generates more features than\n",
    "`OrdinalEncoder`; for each unique categorical value a column is created.\n",
    "\n",
    "Note that the current implementation `HistGradientBoostingClassifier` is still\n",
    "incomplete, and once sparse representation are handled correctly, training\n",
    "time might improve with such kinds of encodings.\n",
    "\n",
    "The main take away message is that arbitrary integer coding of categories is\n",
    "perfectly fine for `HistGradientBoostingClassifier` and yields fast training\n",
    "times."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Which encoder should I use?\n",
    "\n",
    "|                  | Meaningful order              | Non-meaningful order |\n",
    "| ---------------- | ----------------------------- | -------------------- |\n",
    "| Tree-based model | `OrdinalEncoder`              | `OrdinalEncoder` with reasonable depth    |\n",
    "| Linear model     | `OrdinalEncoder` with caution | `OneHotEncoder`      |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lines_to_next_cell": 0
   },
   "source": [
    "<div class=\"admonition important alert alert-info\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Important</p>\n",
    "<ul class=\"last simple\">\n",
    "<li><tt class=\"docutils literal\">OneHotEncoder</tt>: always does something meaningful, but can be unnecessary\n",
    "slow with trees.</li>\n",
    "<li><tt class=\"docutils literal\">OrdinalEncoder</tt>: can be detrimental for linear models unless your category\n",
    "has a meaningful order and you make sure that <tt class=\"docutils literal\">OrdinalEncoder</tt> respects this\n",
    "order. Trees can deal with <tt class=\"docutils literal\">OrdinalEncoder</tt> fine as long as they are deep\n",
    "enough. However, when you allow the decision tree to grow very deep, it might\n",
    "overfit on other features.</li>\n",
    "</ul>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next to one-hot-encoding and ordinal encoding categorical features,\n",
    "scikit-learn offers the [`TargetEncoder`](https://scikit-learn.org/stable/modules/preprocessing.html#target-encoder).\n",
    "This encoder is well suited for nominal, categorical features with high\n",
    "cardinality. This encoding strategy is beyond the scope of this course,\n",
    "but the interested reader is encouraged to explore this encoder."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}